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ABSTRACT 



/ 



Applied Performance Tests (APT) are defined as 
instrUHienrs designed to measure performance in an actual or simulated 
sej:ting. They require at lea^t a close approximation of the setting 
(if not the actual setting) to which the performance is expected to 
be transferred. This paper outlines measurement problems and issues 
that are unique to APT. It is argued that the problems and issues 
that ar6 widely discussed for Criterion referenced tests are also 
appropriate to APT. A brief history of APT is given. A listing of 
relia})ilix.y and validity problems unique to APT is presented and 
discussed. Two additional measure^ment problem areas in APT are their 
o^hjectivity and the gener alizability of their results. Other 
measurement related^ considerations that may he regarded as problems 
in APT. include cost, difficulty of application and development, and 
unavailability of norms £or-test in^terpretation . Finally^ research 
and development steps to address the shortcomings of APT in 
9lemeaJ:ary and secondary education are listed." (RC) 
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Jair.es- R. Sanders 
^-'^ " ' Western Michigan University 

Aoplied Performance Tests (APT) have been defined by Sachse 

and Sanders (1975)|as "instruf*ient5 designed to measure performance in^ 

an^ actual or sinul4ted -setting/' They are measurement devices that . 

reqie^re at- least a close approxin-ation of the setting (if hot the actual 

*^ 

setting) to which *the performance is efoected to be transferred. 

' ^ » / ; 

The ourpose of this oaper is to outl ine, measurement" problems 
and issues that are unique to APT. I would argue that the measureraent 
problems arrd issu£S thati are vndely discussed for criterion referenced ~- 
tests (e.g., j-iarris, Alkin,.a'nd Popham, ]974 ) .are also applicable to 
APT. In order to limit this discussion, and because there are many 
fine discussions of problen^s and issu.es that APT -holds- in commor/^with 
other tests, I will concentrate on some of the more salierrt measurement 
concerns that are unique to^APT. 

* The uniqueness of Ijfil is found in the high degree of realism built 
into the test. (ReaYis^^/del ity, and' auth^ti ci ty ar§ used interchangeably 
in describing the degree to\t^ch these tests reflect real life situations 
that require the behaviors^ing m^sured , folTpviing Sachse and Sanders 
[1975]). Both exeTcisG stimu^lTlndr^onse modes can serve as focal points - 
for applied, perfomange test designation.'^ Both the stimulus and response 
can either have high or low fide! ity . . /if'either have high authenticity, the 
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■^Commentt prepared for a Symposium on Applied Performance Testing: 
Research and- Development Perspectives. Held at the annua\ .n\eeting of the 
Merican £ducational Research A5SO(;iation,. Sain Francisco, .Cal ifornia , April 1976, 



instruments -is generally classified as APT. A figure reproduced from 
Sachse and Sanders (1975) depicts the instruments that may' be classified 
/s, APT where X's denote APT situations: ■ ' ' - * • 
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Examples of tests that wou'hlfal 1 into these^ categories .were orovided.^by • 
Sachse and Sanders (1975) and ar&snot reproduced here. - _ ' 

Tests nay be classifiecl in many different ways. For example, 
we might classify thei-l as measures of cognttive,^ affective, or psycl^omotor 
behaviors. Or/we might classify them in terras of maximum v-ersus^typical 
performance, following Cronbach (1970). Attempts to classify. APT using' 
these categorie^s usually fai^l , tiowever , indicating theoreticaV ' ^ . ^ 
inadequacy in seich classi ficati'on, schemes . The reasons for '§^i^;hf 
fall'from the nature of the performance being' 'bbsery.ed using API. ; 
performance might be an emotional response to some stimulus, pr a psycjto- 
motor performance. It usually involves k?nowl edge, about appropn'ate \ 
responses. In fact, the perfonnances that are typically recorded using • 
APT invoive a complex combination of each type of behavior. In this ■ sefij-e. 
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thep, APT iriay be thought of as nolar instnntteatation (not in the dental ♦ 
sense) rather than instrumentation used tO'-ineasure ntoj^ularor elemental 
behaviors. But even th'i s distinction breaks down in that^lecular 
re^sponses, if they have high- authentici ty , could be measured byNVPT. 
The psychological theory underlying the development and us"expf APT i; 
not^'«^e4iLdevel oped' and leads us to 'problems of definition^ classification, 
and ^nterpretation with APT. Although some would ^rgue that this.is^not 
a^measur^fneht problem or issue^ it is important to note. ' , \ 

\ . ■ ■ - ^ 

Forney Hi''story \ 

Historical ly, -APT has been a mainstay for military and occupational 
testing-for years. Reviews by Fitzpatrick an'd Morrison (1971) and' Panitz 
and Olivo 0970), added to the volume edited by Glaser (1962) provide a 
>ine overview of the development and Use of APT. Professional occupations, 
especially the medical arts, and business and, i_ndustry have a shorter, but 
productive, history. The field of .pub'lic elementary and secondary education 
has >ittle history in the use of, APT, w'i th- i nterest just now developing 
the- areas of teacher-evaluation; ineasureraent of student achievement, andi. 
teacher aad .adniifvi strator training'. The forms -Df ART that have been 
• developed a'nd -used in the ^lil i tary , in occupational examination agencies 
".in medical" centers, and rn business and industry include the fej^wing: 
^ .^"lilyy .. Occupational "APT Medial API , - ' Busine'ss I, I ndustry APT 

simi>^tion ' work products" '/s'l-mulation ' ' simulation 

- garni ng'x on-the-job pitocess stt<U,a1;io'nal. tests . ,gamihg ; 

"■•situational' observaticfi including' problem situational test 

tests • - 7.'^ solving tests^ . •^JiidijiSjn-'- 

' '4.- . ''^sts" 

i - . " s ■ 

All "forms of APT have be^''t««Cby eacti'f 'no' doubt, but the forms iH^ed in 

ach column appeir to be those that have~rgceived the most emphasis. 
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Considerable interest in forms ofv APT' f or use in elementary and 
secondary achievement testing jhas appeared recently/. In a search for 
user^Of-ARtuiL^^'^^^^ schdol [corftent area^we foijnd considerable 
varia>fte^>iii^_^ntent areas. [Reading, mathematics, and phys^ical education 
at. the secondaryH^el, inclu(ied frequent use of APT.. This was considerably 

vel\ Content areas that appeared to be void 



less true at the elementar 
of APT materials included ±\ie social sciences (history, civics, psychology, 
philosophy.; and. economics) , the arts (drama, Ti.terature, and art and music 
form's), the physical sciences (geology, geography, 'bl^ology, chemistry, and 
physics) and, surprisingly, the area of foreign language study. However, 
the fact that formalized, widely available applied performance, tests were 
not found in many public school content areas does not mean that APT is 
aot uled. Rather, performance measurei^ent that does occur usually takes 
'place in an informal manner. .The poiential for developing standard APT . 
materials for public school use of the f ocus'li sted above is great.- It 
.remains ^i^hallenge to those who develop ir/easucement devices to provide 
APT fo?MJse by edtK^tional practitioners. An^xamination of measurement 
'problems and>5^es uni^to APT shoul^rovide some guidance to this effort 



Measurement P robjj^ juicy s^sjje^ 



\ 



By identifying real-ism, of stiiT>ulus a^nd/or- response as the unique 



chaVacter.istic of ;\PT, we have narrowed our discussion to measurement _ 
■ problems;. and i ssues. created by thi s. requi rement.NAt first glance it is 
tempting.^ conclude that there are few measurement problems and issues that 
are uniquie'^to APT, buti further investiga-tion suggests otherwise.. 

^Vi \ 
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Consider, first, the reliability of APT, Certainly w'e have the 
tools to calculate reliabilities,* depending on the form of APT being developed: 

1. For simulation, gaming, and situational tests where mechanical 
or paper and pencil responses are used, the KR-20, or, under 
specific conditions, the alternative ways we have for calculating 
reliability on paper and pencil tests are appropriate, 

2. For rating or ranking work products, interjudge reliability, 
the coefficient of concordence, or nonparametric tests we 
have for ordinal data are sufficient. 

3. For process observation, the same techniques we have'^ "developed 

\ 

for determining the reliability of the many 

clas'SrQom observation schedules that exist are appropriate, ^ 

'^Jhat problems can exist? A listing of reliability problems that are unique 

to APT inc-ludes: 

1/ Control over the testing envi roniil^t. As the realism of 
'APT IS increased, a greater number extraneous variables 




are introduced into the test.- Irrelevant, often random, 
cues on the stimulus presentation will certainly affect 
the exaininee^'^s^r^^onse. Obstructions to the examinee 
in giving the resDonse he would normally give will also 
affect his performance. Thus, testirtg under real conditions 




wHJ -frequently lead to measurements with low reliability. 



2. Number of times one examinee may be "tested /\It has been 
^suggested (e,g.,. Gagne, 1952) thatVepeated med^^ ' ' 

on an individual, where several tasks of the;?^^ 
given, may serve to increase the reliability of APT. Ttowev^er, 

erIc « . \ 
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\.consid$ring the cost of APT (time, faci 1 i ties\>i sk , 
' logistics)^ often only one trial i3 possi We. / The' ^ 
reliability of this one trial is usually low. ~, 

3. Problems with instrumentation. When hardware is^being 
used' to record examinee responses, reliability is usually 
not a problerT.-^ However, when human recorders a^e used, 

; observer variation can adversely affect the reliability 
of the APT. Webb et a^ (1967) have addressed this problem 
' ^ in detai 1 . . . , ^ 

4. Oth^r variations in testing conditions. Conditions in the 
' testing^^nvrronment, as noted earlier, can affect the 

rellabil ity^f the measure. Jhe standardization of APT 
admin istration\^i improve the reliability Of the tests, 
but can a^so remove\real ism from the testing situation. 
Standardization of dir^cti<?ns and admi ni stration'-'time 
are two concerns that sh^uiy be addressed, fhey can also 
affect the validity of, the^Vst. Added to this problem are 
variations due to time of da^ month, or year and psycholo"gical 
and physical state of the.examW-. These, too, affect the 
reliability of the measureniejxtr^though .much the same could 
be said about other tests as well. . • . 
• Another consicieration is the "-validity Of APT'., Jhe criterion . 
\?alidit/ of-'Af'T .i,s important if such tests axe to be used in drawing 
con(;lusions' about one' s. abil i ty to perform certa\^ val ued task^. Smth 
(1975-) provides a nice dncussion of the_ criterion p^Qblem and his discu 
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certainly applies %o APT* Basical ly,vthe validity problems and issues 
're.lated to APT include: ; *■•„•-/ \ 

• 1. Identification -of the ultimatfi criterion task and 
demonstrating, empirically, a relationship between 
performance on an APT and performance on the criterion 
task. This is an easy-sounding undertaking that l^as proven 
to be quite difficult. Task analyses in the mi litary , 
various occupations ,' and in the medical arts have proven. ^ 
• to be productive and form a basis for many APT materials 

(e.g., Osborn, 1975, and the many available HumRRO publications). 
"This process has proven to be much more difficult in . . 
\- . deveVoping APT materials for public -school use ,' especial ly 2 

when affective performance is of interest. The intervening 
^"-^jxperienoes of people throughoi/t. their school years .and ' ^ 

betwl^M the time they receive their secondary diplomas 
* and are cafl^B^W perform^certain valued tasks 4re 

powerful. This pVoblem is an important one^ to be dealt with 
. ' j'n developing APT inaterials for ^public school use that • 

do tl'^l us something about ultimate criterion performance. 
2. 'control- oveA -the testing Environment. The closer to reality^ 
APT -moves, the^ higher the criterion validity. of the measure- 

'ment However, as I noted earlier, reliability is usually 

' "A " ■ ■' ' ' " 

lower unddr r^istic conditions and, as we know, the reliability 

of the test does p^'T'&^j^imits on-its criterion validity. As 

we gain control over ^the tes^Mjc] environment, the criterion 



'Validity is; usually lowered, although the reliability of, v *■ 
^ th^APT is in(3recised. This trade-off* presents a toUgh • . 
probXem to'thQSjp v^ish-ing to develop APT for, public 
school\use. There 'is no good answer to the questions of 
how mucnV control \is appropriate or how far -APT can be . , - 
removed fVom reali\ty and still have^cri terion validity, 

\ \ ' ' 'T, 

Identifying effective stimuli in APT and determining 

} \ ^ ' « " 

valid score)^. This Woblern is' related to tlie first * 

reliability problem d^'scussed earlier, "It i-s difficult . 
to standardize test stMmuli in many real -life situations ^ • 
i^ind, hence, two dTf-fereW exafrrinees may actually be per- ^ 
forming differduit tasks Within the same APT.." For example, 
•one Examinee ma\ received high score on an APT, only 
because-he und^riook the easy 'ei^m^nts of the total ^ * 
task perforniante labile leavVng the difficult parts go^' . ^ 
Another examinee may receive\a low score because he under- 
took theHoughM^arts firs^'t arid failed. Standardization 

■'of testing cgnditions and scok-ing procedur-es presents a 
> » ^. 

. ■ ■ •- , diff icuU. nieasurenient -prob-lei^i to those who V'jish' to develpp^ 

• ■ - , 1-^ \ " .' , , 

• • • ' APT for public school' use. 

Two additional nieasurement problem. areas in AP/t a>^e"the gbjectivfty, 
of such tiieasures and tlie general izabil i ty of the^ir results, i Whett hardware 
is being "used to record the .performance &^ an e%)Tii nee J obj<^\:ti'vity presents 
- little problem. Certainly >airl ine pijot simulators' thit mechanically record 



^the responses of exaiirinees provide 1 ittle. room to dqubt the objefSi vity pf 
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recVded scoresw However, when ob^rvations, ratings or rankings of products, 
or other hurtian recording devices arV used, the bbjectivit^/ of the measuremertt 
is a probleni worthy of cons icieration^nd safeguards against bias peed to, be 
built into' the data col lection 'and scoV-i.ag procedure. 

Many of the con-ments I made e^lier about reliability validity 
'problems of APT'Tefate tg the problem of \eneral ivabi 1 i ty of res'ults. Sta-n 
, dard^zation of'tes'ting conditions, criterioV validity , intervention of t 
e)(tra\eous variables into _either stimulus cof\ditions or responses, and 
scoringXprocedures all help determine whether W exaiiynee s reported 
performan\e can be general i zed .to other settingV other persons, or other 
times. Because these are problem areasfewith APT ,\ one has to include -the 
. generalizabil\ty of APT scores as a problfen also'. Vrom general izabil ity 
"* theory we alt i\iDw that no one Observation of perfoWnce can be considered . 

representative oV the person's ability to perform'. fVasurenient limitations 
--of APT'limjt cjerierXl izabil ity, of scpres oven^ further tVian the^ Timi tations 
of ^c^minercial ly avaAable dchievement tests ^'that School\ now use 

A listing otl other measurement rel>ated conside\a^tions fhat'may 
be regaV-ded as problems in Vt 'include cost, di f f icul ty ^o\ appl ication 
and development, and unavailability of norms for test intenpreta-tion. I, 
suspect others in the Syniposi ui\ wi 1 1 be discus'sing tKese con\^erns,sb I 
leave it to them to elabor^fte. 
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Is APT tp be avoided in eleinentary and secondary education because 
of these shortcomings? I don' t,bel ieve it should. In fact^ I beyeve there 

10 ■ • ,. " 



1s a great/amount'of yet unrealized >otentia] in APT for public school 
use. APT, to be sure, is just one smalUpart of the entire testing ^ 
spectrum used in our schooVg. Itj's not a panacea for testing problems 
in education nor is it a repl dcement'.for the many hj^hVy developed 
technical tools now used. 'It is a way to get information about the 
performance of people op ^ijertain valued task-Ss^ At present, this limited 
^^p^of testing Is'urrd^rdeveloped and underused^n education and I believe 
it is pf^^ueti ve"to*exainine ^ays that we can address\<^ie of the 'Shortcomings 
that h have menK^^d . i ^ » 

I wou'ra suggeHv..the^follovving research" aad development steps 
to address t^se shortcomings:'. . \ . 

^ . - 1.. Curriculum and measurement specialists jneed \o work 
, , together ih ideniifying tasks that are important in their 

own right and those that are ^s^ociated with valued task 

' porformajice in later life. The focus, oKthis inquiry 

^ * \< 

should be on identifying\hose tastes, that ar\wi thin the 
. * scope of the public school cthrriculum. 
■ I '2. Currioulum and measurement specialists need to work 

■ ' through' a national association in task forces or funded 
^ , ■ projects to develop standard APT's that can be made 
, ;. available to schools nationw.ide.. Technical manuals, 

developed- t^o meet the AERA/APA/NCME Standa rds for Educational 
* aiTd/psyj;jioJo_gj_caf Tests^ should be produced for these" tests. I would 
, " expect the' measurement problems I have discussed to be 

addTesspd further by' these projects. - 
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Task analysis studies of valued adult performances 
need 'to be undertaken and the resuV^s linked to public . --^ 
school curriculum. It is important that the elemental , ^ 
tasks for later task performance /ire systemati'cal ly covered 

in the school curriculum. ' ' ' ^ 

i ^ 
Task analysis studies of value(^ performance expected when 

students exit the public schools need to be undertaken and, 

the results lijiked to the K-12 curriculum so, again, instrujc-- 

tiorvon the elemental 'tasks >is not left to chance. 

Although the criterion validity of an -APT is its most 

-important characteristic, there is a need to examine methods 

of "control ling testing conditions in order, to improve the 

reliability of such measures, while at the same time maintaining 

high criterion validity. 

There is a need to' systematical ly study confounding factors 
, ^that affect APT performance for each form of APT. A 
taxonomic description of such factors would lead us a long 
wy toward improving the puality of APT materials. 
There is a need to develop a theoretical foundation for 
^APT. , Vftys of classifying APT materials do not exist, 
undoubtedly because of a lack of theoretical structure. 
Furthermore, it is unc,lear what different forms of APt 
measure (i .e. , siinulatfon's , games, situational tests, process 
observations,- work products). If they measure different 
constructs, or the same constt^uct, but at different levels 

. ■ .' " 12 



of complexity, a theory should reflect this knowledge. 
'Research into the factorial complexity of APT forms would 
also contribute to theory development, 
8. There is a need for creative development of newfo^ms 
, ' ' of APT that may a^lleviate some of the measurement short- 
comings tbat have been discussed. Educational measurement 
specialists funded to exDlore such creative .alternatives 
would contribute new knowledge that would 'have immediate 
use for public school testing. 
Applied PerfoVniance Testing has great appeal^ for measuring task 
performance in the public schools. There is much worlds to be done to refine 
the concept and improve on our techniques. I believe tB,e effort is worthwhile 
and expect to see comparatively great advances in APT in the near, future. 
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